Milk yield prediction in Friesian cows using linear and flexible discriminant analysis under assumptions violations

Background The application of novel technologies is now widely used to assist in making optimal decisions. This study aimed to evaluate the performance of linear discriminant analysis (LDA) and flexible discriminant analysis (FDA) in classifying and predicting Friesian cattle’s milk production into low (\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:<$$\end{document}4500 kg), medium (4500–7500 kg), and high (\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:>$$\end{document}7500 kg) categories. A total of 3793 lactation records from cows calved between 2009 and 2020 were collected to examine some predictors such as age at first calving (AFC), lactation order (LO), days open (DO), days in milk (DIM), dry period (DP), calving season (CFS), 305-day milk yield (305-MY), calving interval (CI), and total breeding per conception (TBRD). Results The comparison between LDA and FDA models was based on the significance of coefficients, total accuracy, sensitivity, precision, and F1-score. The LDA results revealed that DIM and 305-MY were the significant (P < 0.001) contributors for data classification, while the FDA was a lactation order. Classification accuracy results showed that the FDA model performed better than the LDA model in expressing accuracies of correctly classified cases as well as overall classification accuracy of milk yield. The FDA model outperformed LDA in both accuracy and F1-score. It achieved an accuracy of 82% compared to LDA’s 71%. Similarly, the F1-score improved from a range of 0.667 to 0.79 for LDA to a higher range of 0.81 to 0.83 for FDA. Conclusion The findings of this study demonstrated that FDA was more resistant than LDA in case of assumption violations. Furthermore, the current study showed the feasibility and efficacy of LDA and FDA in interpreting and predicting livestock datasets.


Background
Screening milk yield by investigating factors affecting production is critical for breeders.Large datasets, particularly for all animal parities, increase the challenge of prediction due to high dimensions.So, selecting the appropriate statistical method is crucial for analyzing biomedical data.Inappropriate methodological choices can lead to misleading interpretations and unsubstantiated conclusions.Understanding the assumptions and conditions of statistical methods is essential for selecting the appropriate method for data analysis.By doing this, researchers can ensure accurate and reliable data interpretation.Along with knowledge of statistical technologies, an additional important consideration is data type and nature, as well as the study's goal [1].In the case of categorical dependent variables, Discriminant analysis is primarily used to categorize data into two or more groups.
Discriminant analysis focuses on identifying significant factors for group distinction and categorizing unlabeled items.It involves two stages: separation and allocation.Separation aims to establish discriminant functions to separate groups, while allocation uses these functions to allocate unclassified items to recognized categories [2].Additionally, it is carried out by computing the weights for each explanatory variable, which are calculated as between-group variance in relation to within-group variance [3].Discriminant analysis is critical in determining the factors influencing milk production [4][5][6].
Various discriminant functions have been investigated over time, but they all serve the same purpose.Specific discriminant functions are required in certain circumstances.Linear discriminant analysis, Quadratic discriminant analysis, and Regularized discriminant analysis are examples of parametric discriminant analysis, while Flexible discriminant analysis and K-Nearest Neighbor analysis are examples of nonparametric discriminant analysis [7].Linear methods are commonly used but may not always be sufficient in certain cases for accurate classification and error reduction.So, selecting an alternative approach can improve classification accuracy.
The flexible discriminant model (FDA) is a generalized linear discriminant analysis (LDA) method that can create non-linear boundaries for classes.The FDA's inspiration originates from the connection between LDA, canonical correlation analysis, as well as optimal scoring.The primary difference is that the FDA can incorporate quadratic and bilinear predictors as well as nonlinear decision boundaries [8].FDA is a valuable data prediction and classification method, and it is expected to draw the interest of medical investigators [9].
The current study aims to predict Friesian cow milk yield using linear discriminant analysis, use a more flexible classification model (flexible discriminant model) to accommodate the assumptions violation, and compare the performance of both models in anticipating and categorizing milk yield.

Data collection and management
A total of 3793 dairy records were obtained for this study to investigate the significant variables influencing milk production in Friesian cows raised in Egypt.The data were gathered from one of the large farms in the El-Dakahlia governorate of Egypt.This investigation's materials included records of dairy cows that calved between 2009 and 2020.Cows were mechanically milked three times per day under an automatic milking system.The daily milk yield was estimated using the sum of these three values.This study included cows that had completed at least one lactation.The lactation records and all other data are computer-based and updated regularly.On the farms, all cows were occasionally kept on unclean floors, in open yards, or slenderly covered yards with access to cool spraying during the summer.Animals were fed rations with a crude protein content of about 18%, according to National Research Council [10].Isolation units were provided for sick and suspected animals and calving units were used for parturition.Storage units were also provided for feed and equipment.The calfrearing system was an indoor system, allowing calves to stay with their dams and suckle for a limited period each day.

Investigated traits
The research's data were used to discover the major factors affecting milk yield in Friesians using linear and flexible discriminant models.As a result, the variables under consideration were actual total milk yield (kg), which was divided into three categories: low (< 4500 kg), medium (4500-7500 kg), and high (> 7500 kg), and considered the dependent variable of interest, while the rest variables considered as predictors include age at first calving (month), lactation order of eight parities (P1-P8), days open (days), dry period (days), days in milk (days), 305day milk yield (kg), calving season (spring -summerautumn-winter), calving interval (month), and number of breeding per conception (number).Furthermore, the categorization of milk yield variable and the selection of independent variables have been carried out based on previous authors' recommendations [11][12][13][14][15][16].

Data preprocessing
Data were initially checked for missing values as well as outliers.The data showed an imbalance in the number of the three levels of the outcome variable, which ranged between 3485 for the highly-producing cows, 287 for the medium class, and 21 for the low-producing category.The data was split randomly into two parts: 80% of the data was the training set for model creation and adjustment, and the second part was the testing set, representing 20% of the data for model validation and performance.The train set was processed by random over sampling (ROS) "up sampling" using the Caret package.The chi-square test and multinomial logistic regression were used to select variables for multivariate analysis, with all variables showing high significance levels (p < 0.001).For the LDA, the values of the different parameters were also centered and scaled to minimize data variability.Linear and flexible discriminant analysis models were used to classify cow milk yield into high, moderate, and low categories, creating a powerful model for future prediction.

Linear discriminant analysis
Linear Discriminant Analysis is a parametric method used to identify the weights of independent variables that best distinguish between two or more mutually exclusive and exhaustive groups of cases.LDA is also a hard classification approach.Hard classifiers only define a hard classification boundary for each group based on the explanatory variables.A new item is assigned to a group based on characteristics that fit within the group's characteristics [17].Each group has a mean vector that is calculated using the attribute vectors of all the objects in the group.A test observation is assigned to the class whose centroid is the closest, where distance can be expressed as a Mahalanobis metric using the pooled within-group covariance matrix [18].LDA functions are orthogonal to previous ones, and a number of discriminant functions of either G-1 or J, whichever is less, where G represents the categories number of the dependent variable and J represents the exploratory variables number [19].The linear discriminant model is provided below: Where: D jk is the discriminant score of discriminant function j for the milk yield level k, b o is a constant term, b i is the i th predictor's discriminant coefficient, and x ik represent the i th explanatory variable (AFC, LO, DO, DIM, DP, CFS, 305-MY, CI, TBRD) for the milk yield level k.
The function generates discriminant scores, which estimate predicted probabilities for each categorical outcome variable.These scores, along with group means (centroids), aid in classifying cases into groups.A variable's significance in explaining an outcome is indicated by large coefficients [20].LDA has limitations since great accuracy can be achieved with strong multivariate normality assumptions and equal covariance matrices.However, these assumptions are uncommon in practice [2].

Flexible discriminant analysis
Flexible discriminant analysis is a non-parametric generalization of linear discriminant analysis.It can be regarded as a technique of postprocessing a multi-outcome regression with an optimal scoring approach.The FDA procedure is similar to LDA, replacing linear regression with non-parametric or semi-parametric regression, allowing for various regression tools and discrimination rules.This enable a more flexible class boundaries and better discrimination [21,22].FDA was used to develop a coordinate system capable of distinguishing between all three levels of milk yield.To maximize separation between the data points of various milk yield types and to minimize variation among the data points within each milk production type, the data were mapped onto a twodimensional plane.To determine the decision boundary, a regression is run, with the parameters set to reduce the average-squared residual (ASR) [22].
Where the function θ allocates scores for the most accurate prediction of transformed class labels, N denotes the number of observations, K denotes the number of categories, η is the regression fit using nonparametric regression techniques, and L denotes the regularization penalty function used to penalize roughness in the regression fit, with a weighting determined by parameter λ.The FDA permits the use of nonparametric regression techniques, including multivariate additive splines (MARS) [23], neural networks [24], and generalized additive models [22] in model fitting.MARS is a spline regression that substitutes basis functions for the initial data as predictors.
The MARS basis function change permits to specifically blank out regions of a variable by setting them to zero.This enables MARS to concentrate on specific subregions of the data, obtain optimal variables and interactions, and handle complex data structures with high dimensions [25].
In brief, the calculations are as follows: 1. Multivariate nonparametric regression: provide fitted values Y for a multiresponse variable, adaptive nonparametric regression of Y on X.S λ is the linear operator that best fits the final model, and η * is the vector of fitted regression functions, shall be considered.

Optimal scores: calculate the eigen-decomposition
of Y T Y = Y T S λ Y , in which the eigenvectors Θ are normalized, using the formula: A diagonal matrix of the estimated class prior probabilities is calculated as D π = Y T Y /N 3. Update the statistical model from Step 1 by applying the optimal scores: η (x) = Θ T η * (x) .

Assumptions
The choice between LDA and FDA is more dependent on assumptions beyond each approach.In theory, the FDA is more flexible in terms of assumptions, especially variable distributions, and outliers.However, they both share some assumptions, such as observation independence and the absence of multicollinearity between predictors in datasets [26].Hence, LDA required the testing of certain fundamental assumptions before its application.The independent variables were examined for multivariate normality, so, the predictors should be ratio or interval levels.Mahalanobis distance against chi-square was used to determine the presence of outliers.A simple scatter plot was used to assess the data's linearity assumptions between the predictor and response variables [27].The homogeneity of covariance matrices among dependent variable levels was verified using Box's M test [28] while the variance inflation factor (VIF) was used to examine the multicollinearity of suggested discriminators; a VIF value of 1 indicates no correlation, a VIF value between 1 and 5 indicates moderate correlation, and a VIF value greater than 5 indicates strongly correlated [29].

Classification model evaluation
According to Sokolova and Lapalme [30], the prediction model is evaluated using overall classification accuracy (ACC), which measures the ratio of correctly classified observations to total observations.Researchers often use other metrics like sensitivity (SEN) and precision (PREC) to measure the accuracy of classifiers.SEN is calculated as (TP / (TP + FN)), which is the percentage of target class labels correctly classified as desired [31,32], while PREC is calculated as (TP / (TP + FP)), which is the proportion of correctly classified examples.TP represents true positives predicted by the classifier, TN represents true negatives, FP represents false positives, and FN represents false negatives [33,34].An additional metric is known as the F1-score, which is regarded as the harmonic mean of SEN and PREC.It examines the precision and sensitivity of the model in identifying positive cases (PREC) as well as its ability to prevent missing positive examples (SEN).The F1-score is defined as follows: F1 − score = ( 2 * SEN * P REC) / (SEN + P REC) [30].

Software and packages used
Linear discriminant analysis was performed using the MASS package while FDA using the MARS method was performed using the "fda" function from the "mda" package.All analyses were carried out using the R programming language [35] and SPSS version 25.

Tests of assumptions
First, the study checked the assumptions of LDA.There are no missing values in the data.Box plot indicated that the data contains univariate outliers in the majority of the explanatory variables as illustrated in Fig. 1.Whereas multivariate outlier based on Mahalanobis distance revealed that there are many outliers, as demonstrated in Fig. 2 below.The homogeneity of covariance matrices was tested using the Box's M statistic, which showed the violation of that assumption (Box s M = 342, and p < 0.001 ).Results indicated that the univariate normality of independent variables was violated while multivariate normality was met by the Henze-Zirkler test (Henze − Zirkler value = 106.95,p < 0.001).
Nonetheless, the analysis was carried out in violation of this assumption as the discriminant analysis exhibits good results in face and object recognition, even though the normality assumption is frequently violated [36].Scatter plots tested the linearity assumption, showing a linear relationship between the outcome variable and only a few independent variables.The VIF test was used, and there is a moderate degree of multicollinearity as VIF between 2 to 5 for three parameters (DO= 5.84, DIM= 3.95, and CI=2.36).As VIF does not indicate which pair of predictors are correlated, a correlation analysis was performed to determine which parameters are correlated.The results revealed that days open have a significant relationship with days in milk (r = 0.85), and days open are also highly correlated with calving interval (r = 0.76).As a result, we dropped the DO variable to eliminate multicollinearity.

Linear discriminant analysis
The LDA model calculates group means and the probability of belonging to the different categories for each individual.In other words, the average of each explanatory variable within each category.These values could suggest that the DIM may have a greater influence on high milk production (61.6) than on medium (23.6) and low (37.5)milk production, as was also seen with 305-MY.The model's group means indicate good separation ability between milk production levels (high, medium, and low), as displayed in Table (1).
The study found that all independent variables significantly contribute to data classification, indicating a significant difference between milk production levels.The Wilks' lambda value decreases as the independent variable becomes more significant in the discrimination   2), the most significant variables were 305-day milk yield (γ = 0.610), Days in milk (γ = 0.909), and second parity (γ = 0.930).Table (3) displays the model's coefficients, which are linear combinations of predictor variables used to create the LDA decision rule.Standardized canonical discriminant coefficient was used to rank the significance of each variable.A high standardized discriminant function coefficient indicates significant differences between groups on a particular variable, so the first three parities are considered the most significant factors based on these coefficients.This result contradicts Wilk's test results for the significance of predictors observed in Table (2), which show the impact of outliers and collinearity on the data.Therefore, we focused on the structure matrix as its results matched Wilk's test's rank of predictor importance.According to the factor loadings, 305-MY and DIM are important parameters of discrimination on LD 1 , whereas the first and second parities are on LD 2 .
The canonical discriminant function is built using unstandardized coefficients:    There were two functions created: the first had a higher eigenvalue of 0.841 and explained 84.6% of the total variance in the outcome (milk yield level), and the second had an eigenvalue of 0.153 and explained 15.4% of the total variance in the outcome variable (Table 4).Eigenvalue is a measure of the variance that a function can explain.R 2 (Coefficient of multiple determination) is defined as the square of the canonical correlation value.The first discriminant function had mean scores for low milk production of 0.997, medium and high levels of 0.216 and − 1.217, respectively, while the second discriminant function had mean scores of -0.352 for low milk production and 0.546 and − 0.192 for medium and high levels (Table 5).
The LDA model had a good discriminating accuracy of 71% in both the training and testing sets, which reveals that 71% of the milk production was properly classified by the model as displayed in Table (6).By 79%, 63%, and 71%, respectively, the model classifies high, medium, and low cases properly.Only 78.38% of all high cases have been identified by the model, 72.52% of all medium cases were detected by the model, and only 62.58% of all low cases were identified.
The discriminant function plot revealed that the low level had the highest value on function I (Fig. 3).Function II played a minor role in the discrimination process compared to function I. Three milk yield levels membership appeared on the graph at nearly the same level, indicating poor discrimination between groups.

Flexible discriminant analysis
Linear discriminant analysis requires a lot of assumptions to be held.As the results showed a violation of all assumptions, we applied FDA which tolerated the nonlinearity of the data.Regarding multivariate normality, this assumption is rarely met.FDA using the MARS method was used, and the model discriminated the data by two dimensions with a misclassification error of 0.17.The first dimension accounted for 67.4% of the betweengroup variance, and by the second dimension totally reached 100%.
Table 7 displays the flexible discriminant analysis coefficients.The results revealed that the first and second parities are important parameters of discrimination on FD 1 , whereas the sixth and seventh parities are on FD 2 .
The canonical discriminant function is built using unstandardized coefficients: Similarly,   The class score means were presented in Table (8), with clear separation in the first dimension, however in the second dimension, the high and low categories are quite near.Overall, the separation appears to be satisfactory as seen in Fig ( 4).
The FDA model showed a superior discriminating accuracy of 82.45% and 83.19% in the training and testing sets, respectively, indicating accurate identification of milk production, as shown in Table (9).In the training set, the model correctly classifies high, medium, and low instances by 82.71%, 87.16%, and 77.39%, respectively.The model identified only 83.43% of all high cases, 75.90% of all medium cases, and 90.20% of all low cases.

Comparison of classification techniques
The suggested LDA and FDA model's robustness is validated through classification performance measures, each metric is reported for the three classes (high, moderate, and low).Notably, the FDA model achieved higher sensitivity (0.83, 0.87, and 0.77), precision (0.83, 0.759, and 0.90), and F1 score (0.83, 0.81, and 0.83) compared to LDA achieved the highest metrics for the (high) class (0.79, 0.78, and 0.79) for sensitivity, precision, and F1 score, respectively.Additionally, the FDA model exhibited a higher overall classification accuracy of 82% compared to LDA's 71% (Tables 6 and 9).The model's predictive power is low, and to improve results, the days open variable was removed.The findings, however, demonstrated that the assumptions were invalid.As a result, flexible discriminant analysis is created.
The evaluation on the test set revealed a clear advantage for the FDA model in terms of accuracy (82.45 − 83.6%) compared to LDA (71.09 − 76.0%).This difference suggests that LDA may be overfitting the training data, leading to lower generalizability on unseen data.While LDA achieved reasonable accuracy for the "high" class, its performance suffered for the "moderate" and "low" classes.This suggests an inability to accurately predict these categories.Conversely, the FDA model demonstrated good to high sensitivity across all classes (0.85, 0.67, and 0.67 for high, moderate, and low, respectively).While precision and F1-score for FDA decreased for moderate and low classes compared to the high class, they remained superior to LDA's performance for all classes.
Our study aimed to evaluate the impact of assumption violations on model performance.However, we observed Fig. 3 Linear discriminant analysis biplot revealed that the discrimination between group means isn't good as group means are so close that both models, particularly LDA, struggled to predict some classes due to the inherent class imbalance in the data.To isolate the effect of assumption violation and ensure a fair comparison, we re-evaluated the models using a balanced test set.This step aimed to mitigate any potential confounding effect of class imbalance on our results and ensure our choice of model was based solely on the impact of assumption violations.The re-evaluated results, considering the balanced test set, further confirmed the superiority of the FDA model.

Discussion
The current study classified and predicted Friesian milk yield using two classification methods, LDA and FDA.Based on the literature review, no similar studies were observed that implement and investigate LDA assumptions violations and compares performance with the FDA in prediction and classification.Based on past studies, the accurate classification method can vary depending on the data structure.This study highlights the importance of considering data characteristics and model assumptions when selecting classification methods for biological data analysis.El-Bayomi et al. [37] emphasize the value of choosing appropriate statistical techniques to ensure reliable and accurate outcomes.
The data initially displayed a severe imbalance, addressed through up-sampling, as recommended by Hassan et al. [38].We opted against Synthetic Minority Over-sampling Technique (SMOTE) due to its potential drawbacks for sparse minority classes [39] and using SMOTE can disrupt the class separation that discriminant analysis relies on [40].
The LDA model exhibits violation of all assumptions and data contains outliers, even though the multivariate normality assumption is rarely met but still all other assumptions are violated.Therefore, we used another adaptable model that doesn't necessitate linearity, and the results showed that FDA was superior to LDA in prediction and discrimination, which is supported by Moisen and Frescino [41], who noticed that the MARS model performed more accurately for predicting forest aspects than linear models.
If the data set is large and assumptions are violated, the FDA outperforms the LDA.Multidimensional data with outliers and multicollinearity can make it challenging for the LDA model performance to be applied.Application of machine learning algorithms such as artificial neural networks, random forests, and other machine learning models can improve results as stated by Gouda et al. [42], who reported that machine learning can handle both multidimensional and small datasets.
Selecting appropriate evaluation metrics is crucial and challenging, especially for imbalanced data.Traditional metrics like accuracy can be misleading in such scenarios Sokolova and Lapalme [30].A model might achieve high accuracy by simply predicting the majority class most of the time, but this wouldn't reflect its ability to correctly classify the minority class.Additionally, the Kappa value is inappropriate evaluation metric due to its potential bias and inaccuracy, as highlighted by previous study [43].To address this challenge, we employed precision, recall, and F1 score as primary evaluation metrics.These metrics are well-suited for imbalanced datasets providing a more comprehensive picture of model performance [44].
Previous studies assessed LDA and FDA's effectiveness using various criteria.As stated by Solberg [18], the FDA is a less computationally intensive algorithm for non-Gaussian feature classification without advanced parameter specifications.Öztürk and Özdamar [9] stated that FDA performed better than LDA.Furthermore, they   concluded that FDA methods can be applied to nonnormal and heterogeneous data sets, while LDA should be applied to normal and homogeneous data sets.LDA is effective for classification but not for problems with non-normal distributions, according to Fukunaga [45].Naghibi et al. [46] found that LDA is sensitive to outliers and cannot ensure a linear combination between the explained variable and other explanatory variables.So, the FDA exhibited the best performance among discriminant analysis models.However, more guidelines are required to choose between these methods properly.According to LDA, the most important predictors in the discrimination process are DIM, 305-MY.This result is in agreement with the outcomes reported by Vijayakumar et al. [47] and Akkuş et al. [12] who stated that the amount of milk produced by Holstein cows was affected by lactation length.Some studies reported a negative correlation between DIM and milk yield [48,49], others Mellado et al. [50] stated that there is low correlation between the total milk production and the 305-MY.While parities are a major factor in the FDA's discriminating process, this result is consistent with Munim et al. [51], M'hamdi et al. [52], and Moawed and Abd El-Aziz [16] who observed a significant (P < 0.05) influence of par- ity on milk output.Nevertheless, the findings contrasted with those of Habib et al. [53], who indicated that lactation number did not affect milk output (P > 0.05).

Conclusion
Linear discriminant analysis and flexible discriminant analysis are both suitable for classifying dairy cow data and yielding useful information, but their assumptions and methodology differ.Linear discriminant analysis requires more assumptions about underlying data, while the FDA doesn't.Flexible discriminant analysis is more robust in cases of assumption violations.Overall, the FDA is a more powerful and adaptable classification method than LDA.However, the computing cost is higher, and the findings may be more challenging to understand.

Fig. 4
Fig. 4 Flexible discriminant analysis's discrimination biplot.The plot showed remarkable differentiation between the three categories of milk production.(1) The color orange stands for a high milk production.(2) The color blue stands for a medium category.(3) The color green stands for a low category

Table 1
Group means for each predictor within each class using LDA

Table 2
Predictors' role in explaining outcome using linear discriminant analysis **Coefficient is significant at a 0.001 level of significance (P < 0.001) *Coefficient is significant at a 0.05 level of significance (P < 0.05)

Table 3
Standardized and unstandardized discriminant coefficients for the predictors

Table 4
Eigen values and Wilks' Lambda for linear discriminant functions

Table 5
Group centroids of low, medium and high-level milk yield by LDA

Table 6
LDA evaluation metrics for the model performance on train and test sets

Table 7
Coefficients of the model for the first two dimensions of flexible discriminant analysis

Table 8
Group centroids of low, medium and high-level milk yield by FDA